Tutorial 3: Building and Working with DataFrames

Author

Danet and Becks, based on originals by Delmas and Griffiths

Published

February 28, 2023

Working with vectors, arrays and matrices is important. But quite often, we want to collect high-dimension data (multiple variables) from our simulations and store them in a spreadsheet type format.

As you’ve seen in Tutorial 1, there are plotting macros (@df) within the StatsPlots package that allow us to work with data frame objects from the DataFrames package. A second benefit of the data frame object is that we can export it as a csv file and import this into R where we may prefer working on plotting and statistics.

To this end, here we will also introduce the CSV package, which is very handy for exporting DataFrame objects to csv files, and importing them as well, if you’d like.

The Data Frame

To initialise a dataframe you use the DataFrame function from the DataFrames package:

dat = DataFrame(col1=[], col2=[], col3=[]) # we use [] to specify an empty column of any type and size.
0×3 DataFrame
Rowcol1col2col3
AnyAnyAny

Alternately, you can specify the data type for each column.

dat1 = DataFrame(col1=Float64[], col2=Int64[], col3=Float64)
0×3 DataFrame
Rowcol1col2col3
Float64Int64DataType

Of course, col1 is not the only label you provide: variable names are super important and the conventions we use in R are also important here in Julia, e.g. a.b or a_b or AaBa but not a b (no spaces allowed).

# provide informative column titles using:
dat2 = DataFrame(species=[], size=[], rate=[])
0×3 DataFrame
Rowspeciessizerate
AnyAnyAny

Allocating or adding data to a data frame.

To add data to a dataframe, we use the push! command.

species = "D. magna"
size = 2.2
rate = 4.2
4.2
# push!() arguments: data frame, data
push!(dat2, [species, size, rate])
1×3 DataFrame
Rowspeciessizerate
AnyAnyAny
1D. magna2.24.2

Of course, the push!() function can append data to the existing data frame. It is worth noting that push! can only append one row at a time. But since Julia is so good with loops (compared to R), this will make adding data to a dataframe really easy, and we’ll learn how to do this in the next tutorial.

species2 = "D.pulex"
size2 = 1.8
rate2 = 3.1

# push!() arguments: data frame, data
push!(dat2, [species2, size2, rate2])
2×3 DataFrame
Rowspeciessizerate
AnyAnyAny
1D. magna2.24.2
2D.pulex1.83.1

Helper Functions for Data Frames

You can print data frames using println

println(dat2)
2×3 DataFrame
 Row │ species   size  rate 
     │ Any       Any   Any  
─────┼──────────────────────
   1 │ D. magna  2.2   4.2
   2 │ D.pulex   1.8   3.1

There are first and last function that are like head and tail in R and elsewhere, with a first argument the data frame and the second argument the number of rows.

first(dat2, 2)
2×3 DataFrame
Rowspeciessizerate
AnyAnyAny
1D. magna2.24.2
2D.pulex1.83.1
last(dat2,2)
2×3 DataFrame
Rowspeciessizerate
AnyAnyAny
1D. magna2.24.2
2D.pulex1.83.1

And as we learned with matrices and arrays, the [row, column] method also works for data frames:

dat2[1,2]
2.2
dat2[1,:]
DataFrameRow (3 columns)
Rowspeciessizerate
AnyAnyAny
1D. magna2.24.2
dat2[:,3]
2-element Vector{Any}:
 4.2
 3.1

The CSV

As with R, there are functions to read and write .csv files to and from dataframes. This makes interoperability with tools in R and standard data storage file formats easy.

To write our daphnia data to a csv file, we use a familiar syntax, but a function from the CSV package.

CSV.write("daphniadata.csv", dat2)

Of course, you can read files in using…. yes, CSV.read. Note the second argument declares the data to go into a data frame.

daph_in = CSV.read("betterDaphniaData.csv", DataFrame)